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Abstract. There are two major goals that must be addressed in a 
portable back end: a good sequence of instructions must be selected mak- 
ing full use of the capabilities of the machine, and it must be possible to 
orchestrate target-specific optimizations. A key to the first problem is the 
language MLR1SC, intended in part, to represent the simplest and most 
basic operations implementable in hardware. The importance of MLR1SC 
is that it provides a common representation for expressing the instruc- 
tion set of any hardware platform. Bottom-up tree pattern matching with 
dynamic programming, expressed using succinct and clear specifications 
of the target instruction set, is used to generate target machine code 
from an MLR1SC program. Target-specific optimizations are performed 
by parameterizing off-the-shelf optimization modules with concepts com- 
mon across architectures. The specification of a variety of architectures, 
and the ability to mix and match sophisticated optimization algorithms 
are shown. The resulting back end is independent of the intermediate 
language used in SML/NJ, and could in principle be used in a compiler 
for a source language quite different from SML. We argue that porting 
the compiler to a new architecture requires substantially less effort than 
the existing abstract machine approach, and report significant gains from 
preliminary architecture description driven optimizations. 


1 Introduction 

Portability is crucial to the widespread use and acceptance of any new language. 
Not only must the compiler be readily portable to a wide variety of architectures, 
but it must also generate code that is competitive with one where portability is 
not an issue. The compiler cannot be biased towards one architecture. 

The Standard ML of New Jersey system (SML/NJ)[3, 4] is a highly opti- 
mizing compiler that uses the Continuation Passing Style (CPS) intermediate 
form for optimization[2]. Most of the optimizations in the compiler are done at 
this level. The code generation model has been based on an abstract machine 
called the cmachine for code machine. The cmachine has a small set of registers, 
and a fairly high level instruction set. There is a cmachine instruction that can 
expand to several hundred instructions. Registers include: an allocation pointer 
representing the next available location in the heap, a limit register representing 
the highest address in the heap, a set of miscellaneous registers for parameter 



passing, and others. The compiler is ported to a new architecture by providing 
a mapping of the cmachine registers to physical registers, and templates that 
macro-expand cmachine instructions into target machine instructions. Such a 
port is unsatisfactory in several ways: useful low level optimizations are omitted 
in the translation to machine code; it may not be possible to use the full ca- 
pabilities of the target architecture and its instruction set, and it is sometimes 
difficult to incorporate target-specific optimizations. The back end is respon- 
sible for linking, scheduling, span-dependency analysis, and binary instruction 
output. There is no dependence on the host assembler and linker. Various target 
specific optimizations, such as scheduling are manually implemented for each 
architecture. 

Two major problems must be addressed in a portable back end, namely: a 
good sequence of instructions must be selected for the task, making full use of 
the available registers and instructions, and it must be possible to orchestrate 
optimizations specific to the architecture, without compromising portability. 

2 Overview of Our Approach 

Our new approach is not biased towards any architecture. CPS is compiled to a 
tree language called MLRISC; intended in part, to describe the simplest kinds 
of operations implementable in hardware. No assumptions are made regarding 
addressing modes or types of instructions, and because of our register alloca- 
tion scheme, there are few assumptions made about physical registers. The ML- 
RISC is then converted to a flow graph of target machine instructions, which 
is optimized using generic optimization modules parameterized over a machine 
description. 

The importance of MLRISC is that it provides a common medium for the 
specification of any instruction set. There are basic operations implementable 
in hardware, and instructions are made up of these operations. An instruction 
ought to be definable using these basic operations. 

Bottom-up tree pattern matching with dynamic programming (BURG) is 
central to our approach. The translation from CPS to target machine code pro- 
ceeds in three major phases, two of which involve BURG specifications (Figure 1). 
The high level CPS is first translated into a simpler form called ctrees, suitable 
as input to BURG. Several optimizations are performed during this simplifi- 
cation. Using the BURG specification W, the ctree language is rewritten to 
MLRISC trees, optimizing the tagging and untagging of arithmetic operations 
along the way. BURG is used once again to translate MLRISC to a flowgraph 
of target machine instructions. The specification is a description of the target 
architecture Instruction set and registers. Various optimizations such as liveness 
analysis, scheduling, span-dependency analysis, and graph-coloring register al- 
location are performed on the target machine instructions. The optimization 
phase is parameterized over a machine description represented as a set of SML 
modules. The back end is constructed by a series of functor applications and is 
a nice demonstration of the flexibility provided by the module system[15]. The 



concise description of the instruction set in terms of MLRISC, and the ability to 
perform architecture description driven optimizations, are the key contributions. 


tagging/untagging instruction machine 



Fig. 1 . Flowchart of new code generation model 


3 ML-Burg 


The new code generation strategy is implemented using a SML version of iBurg[10, 
13]. Given a tree rewriting system augmented with costs, ML-Burg generates a 
program to perform bottom-up tree pattern matching with dynamic program- 
ming. A successful reduction of the input tree, corresponds to rewriting the 
input tree to a special non-terminal symbol called the start non-terminal. Upon 
successful reduction, facilities are provided to walk the tree emitting semantic 
actions associated with the rules that matched. 

Consider the rewrite system specified below: 


reg : 

LI; 

(1) 

reg : 

ADD(reg,LU) 

(1) 

reg : 

ADD (reg, reg) 

(1) 


ADD is a binary node with the usual meaning, and LI; is a leaf node representing 
the integer immediate i. The integer i is not used in pattern matching and is 
not part of the rewrite rule, but it is an attribute that may be used in semantic 
actions . This grammar specifies that: an input tree matching LI; can be reduced 
to the non-terminal reg with a cost of one; an input tree matching ADD(reg,LI;) , 
where the first child can be reduced to reg, can also be reduced to the non- 
terminal reg. The grammar above is clearly ambiguous as there are two ways to 
reduce the tree ADD(Ll3,Ll4) to the non-terminal reg. The two reductions are 
shown below, where each reduction is annotated with SPARC assembly code. 
The registers, */,tl, */,t2, and */,t3 are pseudo-registers that are assigned to physical 
registers in a later register allocation pass. 



add "/.go , 3 , 7.t 1 


ADD(LI 3 ,LI 4 ) 

4 ; ; add 7.g0,3,7.tl 

ADD(reg,LI 4 ) 

4 ; ; add 7.g0,4,7.t2 

ADD(reg,reg) 

4 ; ; add 7.tl ,7.t2,7.t3 

reg 


ADD(LI 3 ,LI 4 ) 

4 

ADD(reg,LI 4 ) 

4 ; ; add 7.tl,4,7.t2 

reg 


It is precisely this ambiguity in specification that is the strength of tree rewriting 
code generation techniques. Dynamic programming finds the cheapest set of 
instructions to implement the program. The reduction on the left has a cost of 
three, while the one on the right has a cost of two. This example however, does 
not demonstrate the ability to describe different register classes available on the 
target architecture, or non-regular register sets (Section 6). 


4 MLRISC 

Figure 2 shows the SML signature for MLRISC. The instruction set, described 
by the datatype mlrisc, makes no assumptions about addressing modes on the 
target machine. It is possible to JMP to anything, and LOAD/STORE from anything. 
Each mlrisc instruction defines a basic combinator that will be used to describe 
the target instruction set. A BURG grammar is used to define the instruction 
set, and the associated semantic actions can be used to effectively utilize the 
hardware. 

Non-commutative arithmetic operations specify the order of evaluation of 
arguments, using the type order. The order of evaluation must be recorded to 
preserve the semantics with respect to arithmetic exceptions. Thus instructions 
like SUB and DIV, etc., specify the order of evaluation. The order is assumed to 
be left-to-right for commutative operators. 

There is a commitment to general purpose and floating point registers, REG 
and FREG respectively. Nearly all processors today have these sets of registers 
and provide dedicated instructions to operate on them. This does not preclude 
the Motorola 68000 that does not have general purpose registers. BURG non- 
terminals may be used to represent the Motorola 68000, address and data regis- 
ters (Section 6). 

Lastly, MLRISC has no connection to the CPS intermediate representation 
or dedicated registers, and can be easily divorced from SML/NJ system. 

5 Ctree and MLRISC Generation 

The ctree representation is used to simplify the high level semantics of CPS, 
and provide a suitable tree representation for input to BURG. Generating a 
tree representation from the linear CPS input must observe the semantics with 
respect to arithmetic exceptions and memory. Several low level optimizations 
may be performed on the ctree representation, such as: 



structure Label : sig 


datatype label = 


end = 

struct . . . end 
signature MLRISC = sig 

datatype order = LR I RL (* order of evaluation *) 


datatype bcond = LT I LE I EQ 
datatype mlrisc 
= REG of int 

I FREG of int 

I LI of int 

I MV of mlrisc * mlrisc 

I FMV of mlrisc * mlrisc 

I ADD of mlrisc * mlrisc 

I SUB of mlrisc * mlrisc * 

I ANDB of mlrisc * mlrisc 

I LOAD of mlrisc 
I STORE of mlrisc * mlrisc 

I CVTI2D of mlrisc 
I FADDD of mlrisc * mlrisc 

I BR of Label. label 
I JMP of mlrisc 
I BCC of bcond * mlrisc * 
I FBCC of bcond * mlrisc * 
I SEQ of mlrisc * mlrisc 

end 


I GEU (* branch conditions *) 

(* instructions *) 
(* register *) 
(* floating register *) 
(* integer constant *) 
(* move *) 
(* floating point move *) 
(* addition *) 
order (* subtraction *) 

(* logical AND *) 

(* memory operations *) 

(* convert integer to double *) 
(* floating point addition *) 

(* branch instructions *) 

mlrisc * Label. label * order 
mlrisc * Label. label * order 

(* sequencing *) 


Fig. 2. MLRISC specification 


— The detection of situations where a record creation in the heap can be im- 
plemented as a tight loop copying consecutive locations from one memory 
area to the record being created. 

— Propagating increments to the allocation pointer, so that it is performed 
only once at the function exit points. 

Dynamic programming is used to optimize the tagging and untagging of 
arithmetic expressions, in the translation of ctrees to MLRISC. Integers in SML 
are tagged with their lowest bit set to one, (i.e. , the integer n is represented 
as 2 n + 1). On the MIPS, the old code generator expands the CPS program 

(x:=a-b; z:=x+y)to: 



sub 

a,b,tl 

7. tl 

: = a 

- b 

add 

tl , 1 ,x 

7. x : 

:= tl 

+ 1 

sub 

x, 1 ,t2 

7. t2 

: = x 

- 1 

add 

t2,y,z 

7. z : 

:= t2 

+ y 


Clearly the intermediate tagging and untagging is unnecessary. Dynamic pro- 
gramming using BURG is a fast and elegant solution to this problem. Peterson’s 
min-cut algorithm is more thorough but expensive (0(n 3 ))[16]. The BURG spec- 
ification W contains 124 rules (details appear in an extended versionfll]). 

The resulting MLRISC program represents the simplest set of operations 
required to implement the CPS program. The burden of various optimizations till 
this point, in the abstract machine model, would have been on the person porting 
the compiler. These optimizations would be repeated for each architecture. Now, 
it has been transferred once and for all, to a person that is an expert on the 
internals of the compiler. These basic operations must now be combined to match 
instructions on the target machine. 

6 Instruction Set Specification 

As a concrete example, Figure 3 introduces a fragment of the SPARC specifica- 
tion. An effective address on the SPARC can either be a register + displacement 
or a register+register. This is specified using the non-terminal ea. The semantic 
actions associated with rules that reduce to ea, return a value of type eaValue. 
The operand to a LOAD must be reduced to the non-terminal ea, and the code 
to emit is a simple case statement over the various eaValue constructors. For- 
tunately, no restrictions were imposed on the operand to LOAD in the MLRISC 
design. This example extends to handle the full set of addressing modes and 
instructions found on CISC machines such as the Intel i486 or Motorola 68000. 
A description of the i486 addressing modes involves just 10 lines of BURG spec- 
ification. 

An example from the Motorola 68000, illustrates how simple specifications 
can later on yield high quality code, and the use of non-terminals to denote 
various kinds of register classes. On the 68000, certain kinds of registers are 
not permitted as operands to instructions. For example, the operand to LOAD 
must be reducible to an address register. The result of the load may be either an 
address or data register. This is fairly easy to specify by devoting a non-terminal 
to address registers. A possible fragment of the 68000 specification is shown in 
Figure 4. 

For correctness, a movl is required in the implementation of ADD. Since we 
assume an infinite number of registers, which are later assigned to physical reg- 
isters, these moves normally turn out to be harmless. Coalescing non-interfering 
live ranges in a graph-coloring register allocation algorithm[9], collapses rd and 
dregi to the same physical register where possible, eliminating the redundant 
move. This technique is used quite effectively to handle the non-regular register 
set on the Intel i486. These specifications and semantic actions are very simple, 
yet they describe quite varied and complex systems. 



datatype eaValue = DISPea of register * int 

I INDXea of register * register 


ea: 

ADD(reg,LU) 

(0) 

DISPea(reg , i) ; ; 

ea: 

ADD(regi ,reg 2 ) 

(0) 

INDXea(regi ,reg 2 ) ;; 

ea: 

SUB (reg , LI;) 

(0) 

DISPea(reg , ~i) ; ; 

ea: 

reg 

(0) 

DISPea(reg,0) ; ; 

reg: 

LOAD(ea) (1) 

let val rd : register = newRegO 
in 


case ea 

of DISPea(rt,n) => emit (ld(rt , IMMED n,rd)) 
I INDXea(rs ,rt)=> emit (ld(rs , REG rt,rd)) 
(* esac *) ; 
rd 

end ; ; 


Fig. 3. SPARC instruction set specification 


areg: LOAD(areg) (1) ... 

dreg: LOAD(areg) (1) ... 

dreg: ADD(dregi ,dreg 2 ) (1) let val rd = newDregO 

in 

emit (movl (rd,dregi) ) ; 
emit (addl (rd,dreg 2 ) ) 

end 


Fig. 4. Motorola 68000 instruction set specification 


The combination of ML-Burg and MLRISC is an elegant way to solve the 
instruction selection problem. BURG is expressive enough to allow the con- 
cise specification of most instruction set. A similar observation was reported by 
Appel[5], who wrote TWIG[1] specifications for the VAX and Motorola 68000; 
detailed information was encoded in the cost function to aid in the selection 
of the best rule. Porting the compiler does not require knowledge of any com- 
piler internals, such as tagging schemes, runtime representations, and semantics 
of high level abstract instructions (often specific to SML). The instruction set 
must be specifiable in MLRISC, which is then used to pick the cheapest instruc- 
tions to emit with respect to the cost function. Since the generated MLRISC 
data structure is larger than the source CPS, the instruction selection is done in 
small units. 



7 Target Machine Architectural Description 


Once instruction selection has been performed, facilities exist to generate a 
generic control flowgraph, where the nodes contain target machine instructions. 
It is not possible to directly output the binary representation of instructions as 
they contain pseudo-registers and symbolic labels. Instruction scheduling, span- 
dependency resolution, and further optimizations, may be necessary before final 
binary code emission. Writing target-specific optimizations for each architecture 
would be a portability nightmare. Instead, we use a scheme where off-the-shelf 
optimization modules are parameterized over a description of the target machine. 

While all machines are different in detail, they are all very similar in concept. 
The idea behind our machine description is to describe those concepts that are 
common across architectures, and use them in generic optimization modules. 
The structure of the machine description is shown in Figure 5. At the lowest 
level of the module dependency is a description of the storage units on the 
machine, specified by the signature CELLS. Several dataflow problems require 
efficient operations over sets of cells, so we require the type cellset and the 
usual set operations over them. These are easily constructed using modules de- 
fined in the SML/NJ Library [6]. The signature INSTRUCTION is a specification 
of the available instructions on the machine in terms of its cells. This hierar- 
chy corresponds to the fundamental design of von Neumann machines. Lastly, 
the signature INSN_PROPERTIES contains the bulk of the machine description. 
Useful properties of the instruction set are collected here, and used in generic 
optimizations modules. For example: the type kind, returned by the function 
instrKind, is used to classify instructions as being either a NOP (IK_N0P), 
a jump instruction (IK_JUMP), or any other (IK_INSTR); the type target re- 
turned by branchTargets is used to describe the target of branch instructions. 
instrKind and branchTargets are used to implement a generic module that 
produces a flowgraph specialized over instructions of the target machine. 

Figure 6 shows the machine description for the SPARC. The type cell in- 
cludes: an unlimited supply of general and floating point registers (Reg and 
Freg, respectively), the condition code register (CC), and the floating point con- 
dition code register (FCC). The stack (STACK) and memory (MEM), which are not 
normally considered to be in the same category as registers, are also included. 
Instructions that access the memory or stack, will be marked as accessing the 
MEM or STACK resource. This information is used during instruction scheduling. 
Sparclnstr is the module matching INSTRUCTION in the machine description. 
The type operand has been simplified for expository purposes. SparcProps shows 
a fragment of the module matching INSN_PROPERTIES. The module is a total of 
460 lines, most of which is boiler-plate. 

As additional optimization modules are developed, one may expect the sig- 
nature for INSN_PROPERTIES to grow, in order to meet the demands for more 
information about the target architecture. After a certain point in this evolu- 
tion, generating high quality code for a new architecture will involve mixing and 
matching off-the-shelf optimization modules to suit the architecture. 



signature CELLS = sig 
type cell 
type cellset 

val cardinality : cellset -> int 

val union : cellset * cellset -> cellset 

val add : cell * cellset -> cellset 

end 

signature INSTRUCTION = sig 
structure C : CELLS 
type instruction 
end 

signature INSN_PROPERTIES = sig 
structure I : INSTRUCTION 
structure C : CELLS 
sharing I . C = C 

datatype kind = IK_N0P I IK_JUMP I IK_INSTR 

datatype target = LABELED of Label. label I FALLTHROUGH I ESCAPES 

val instrKind : I . instruction -> kind 

val defUse : I . instruction -> C. cellset * C. cellset 

val branchTargets : I . instruction -> target list 

end 


Fig. 5 . Machine description 


8 Target Machine Optimization 

A generic basic block scheduler that is parameterized by a machine description, 
described above, has been developed. Figure 7 shows the machine properties 
required for this purpose. We describe each component individually in more 
detail to illustrate their complexity (or more appropriately, lack of): 

branchDelayedArch is a boolean flag that indicates if the architecture requires a 
branch delay slot. Special considerations are used for picking this instruction 
if needed. 

latency(msfr) is a function that returns the number of cycles needed to execute 
the instruction instr. 

needsNop {instr , instr s) during scheduling there may not be enough instruc- 
tions available to keep the pipeline busy while executing high latency instruc- 
tions. Further, some architectures require an explicit NOP (No OPeration) 
instruction between two instructions under such circumstances. For example, 
on the MIPS, a MFHI instruction must occur at least two instructions after a 



structure SparcCells = struct 
structure S = SortedList 
datatype cell = Reg of int 
I Freg of int 
I CC | FCC | STACK I MEM 
type cellset = int list * int list * int list 
fun cardinality (r,f ,e) = length r + length f + length e 
fun union( (rl ,f 1 ,el) , (r2 ,f 2 , e2) ) = 

(S .merge (rl ,r2) ,S . merge (f 1 , f 2) ,S .merge (el ,e2) ) 

end 

structure Sparclnstr = struct 
structure C = SparcCells 
datatype operand = REGrand of int 
I IMrand of int 
I LABrand of Label . label 

datatype cond_code = CC_A I CC_E I CC_NE I CC_G I CC_GE 
I CC_L | CC_LE | CC_GEU I CC_LEU 
datatype instruction 
= NOP 

I LD of int * operand * int 

I ADD of int * operand * int 

I ADDCC of int * operand * int 

I JMPL of int * Label. label list 

I BCC of cond_code * Label. label 
I FBCC of cond_code * Label. label 

end 

structure SparcProps = struct 
structure I = Sparclnstr 
structure C = SparcCells 

datatype kind = IK_N0P I IK_JUMP I IK_INSTR 

datatype target = LABELED of Label. label I FALLTHROUGH I ESCAPES 
fun instrKind(I.NOP) = IK_N0P 

I instrKind(I . BCC _) = IK_JUMP 

I instrKind(I. JMPL _) = IK_JUMP 

I instrKind(I . FBCC _) = IK_JUMP 

I instrKind _ = IK_INSTR 

fun branchTargetsd .BCC (I . CC_A, lab) ) = [LABELLED lab] 

I branchTargetsd. BCC(_, lab)) = [LABELLED lab , FALLTHROUGH] 


end 


Fig. 6. SPARC machine description 



val branchDelayedArch : bool 


val latency 
val needsNop 
val defUse 
val isSdi 
val minSize 
val maxSize 
val sdiSize 
val expand 


I . instruction -> int 

I . instruction * I . instruction list -> int 
I . instruction -> int list * int list 
I . instruction -> bool 
I . instruction -> int 
I . instruction -> int 

I . instruction * (int -> int) * int -> int 
I . instruction * int * (int -> int) -> 

I . instruction list 


Fig. 7. Machine properties for basic block scheduling 


MULT instruction, needslop returns the number of NOPs required between 
instr (the instruction being emitted), and instrs (the previous instructions 
emitted). 

defUse (instr) returns list of resources defined and used by the instruction. This 
is used to construct the data dependency graph. 
isSdi (instr) returns true if instr is a span-dependent instruction whose size is 
determined by the final value of labels. 

minSize /maxSize (instr) returns the minimum/maximum size of the instruc- 
tion instr. These two functions are used to schedule blocks with span- 
dependent instructions. The value of labels is calculated assuming all in- 
structions expand to their minimum size. Another calculation is performed 
assuming all labels expand to their maximum size. If the size of a span- 
dependent instruction does not vary under these extremities, then it may 
be expanded, and scheduled along with the other instructions in that block. 
Such a block is said to be stable. Scheduling a basic block refines the value 
of labels under these two extremities and may stablize an otherwise unstable 
block. If unstable blocks still persist, then there is no option but to expand 
the span-dependent instructions to their maximum size. 
sdiSiz e(instr , labMap ,loc) returns the size of the span-dependent instruction 
instr, under the assignment of labels given by labMap, where the current 
location counter is loc. 

expand (msfr , size , labMap) returns the sequence of instructions when the span- 
dependent instruction instr is expanded to size number of instructions, as- 
suming the assignment of labels given by labMap. 

The generic basic block scheduler is 397 lines of SML code. The module to 
perform span-dependency analysis is 384 lines. In a similar fashion to basic block 
scheduling, we have developed a generic graph-coloring register allocator used 
to allocate general purpose and floating point registers on most target machines. 
In addition, on the IBM RS/6000 it is used to allocate pseudo condition code 
registers among the eight condition code registers available. More optimizations 
are planned in the near future. 



9 Mix and Match 


Figure 8 shows the construction of the SPARC code generator, which is formed 
by linking several optimization phases. The FlowGraph functor produces a flow- 
graph data structure specialized over the SPARC instructions. The Liveness 
functor exports a function called liveness, which will annotate the flowgraph 
with liveness information at block boundaries. Optimizations are mixed and 
matched using functor applications. The functors RegAllocator and FlowGraphGen 
implement a certain optimization, and requires a function codegen that will be 
invoked to perform the rest of the optimizations. The functor BBSched that per- 
forms basic block scheduling, is the last in the chain, and exports a function 
called finish, that does the final machine code output. The SPARC code gen- 
erator, in Figure 8, strings together: flowgraph generation that includes liveness 
analysis (FlowGen); integer register allocation (intRAlloc); floating point regis- 
ter allocation (FloatRAlloc), and finally basic block scheduling (BBsched). The 
various parameters to these functors are unimportant except to note that they 
are specified by signatures that describe generic properties of architectures. The 
example illustrates that the use of functor application makes it easy to mix and 
match generic optimization modules to suit the SPARC architecture. 


structure SparcFlow = FlowGraph (structure Instr = Sparclnstr) 

structure SparcLive = Liveness (structure Flowgraph = SparcFlow 

structure InsnProps = SparcProps) 

structure BBsched = BBSched(structure Flowgraph = SparcFlow 

structure InsnProps = SparcProps 
structure Emitter = SparcMCEmitter) 

structure FloatRAlloc = RegAllocator (structure Ra = FloatRA_Arg 

val codegen = BBsched. bbsched) 

structure IntRAlloc = RegAllocator (structure Ra = IntRA_Arg 

val codegen = FloatRAlloc . ra) 


structure FlowGen = FlowGraphGen( 

structure Flowgraph = SparcFlow 
structure InsnProps = SparcProps 
val codegen = SparcLive . liveness) 


Fig. 8. Gluing the SPARC code generator together 



10 Results 


At the time of writing, we have working code generators for the MIPS, IBM 
RS/6000 and SPARC, an an untested specification for the Intel i486. The pre- 
liminary results reported here are only for the SPARC. 

A fairly standard set of SML benchmarks are used[2]. We first measure 
the improvements from using a more sophisticated register allocation scheme. 
SML/NJ supports a register passing style for parameters, and it is essential that 
operands be computed in the right register. Register constraints may require that 
the operand be first computed into a temporary and later moved into the correct 
register before a function call. The first column of Figure 9 shows the number 
of register-register moves required at function call boundaries in the existing 
compiler (version 0.93). The second column shows the performance of our new 
graph-coloring register allocator. At least 40% of the original register-register 
moves are removed. 

Figure 10 shows the improvements from dynamic programming and the al- 
location pointer optimization described (Section 5). The first column shows the 
static code size (in number of instructions) without any optimization, and the 
second with these optimizations. There is a static code size improvement of 2- 
5%. In terms of dynamic instruction counts, this corresponds to roughly 1-3% 
improvement. This is encouraging as these improvements have come largely for 
free in our attempt to improve portability. Machines such as the Dec Alpha or 
the IBM RS/6000, should do even better, because multiple overflow checks may 
be collapsed into one. 

While compile time speeds are acceptable, we do not report them since the 
new system is not tuned or optimized for this. The back end in the current 
SML/NJ compiler takes about 25% of the total compilation time. This percent- 
age does not include CPS optimization. The new back end is currently about 
3-4 times slower. 


11 Future Work 


Machine descriptions are required for all the architectures that the SML/NJ 
compiler currently supports, which include, the Motorola 68000, and the HPPA. 
Work is in progress on a DEC Alpha port. The main areas for future work 
relate to the speed of compilation, and further optimizations relevant to RISC 
processors. Our compilation scheme is highly symbolic — developing fast table 
driven optimizations[17] derived from a more concise machine description, and 
the use of partial evaluation[7] ought to produce a faster backend. Composing 
BURG specifications similar to that done for attribute grammars may also prove 
worthwhile [8]. Lastly, a pre-pass global scheduler is extremely important for 
superpiplined and superscalar machines[12, 14]. 
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Fig. 9. register-register moves with 6 callee-save registers 
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Fig. 10. Static code size and dynamic instruction count improvements 


12 Conclusions 

A highly portable and optimizing back end has been described. It addresses the 
problems of target machine instruction selection and machine specific optimiza- 
tion. Porting the compiler to a new architectural platform is expected to be 
trivial for someone ignorant of the internal of the compiler, but familiar with 
the architecture. Off-the-shelf optimization modules can be easily constructed to 
suit a particular machine. The set of optimizations currently implemented show 
encouraging results. 
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